1 New York Stock Exchange (NYSE) data (1962-1986) (140 pts)
The NYSE.csv file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days).
Log trading volume (\(v_t\)): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale.
Dow Jones return (\(r_t\)): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days.
Log volatility (\(z_t\)): This is based on the absolute values of daily price movements.
# Read in NYSE data from urlurl ="https://raw.githubusercontent.com/ucla-biostat-212a/2025winter/master/slides/data/NYSE.csv"NYSE <-read_csv(url)NYSE
The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t, v_{t-\ell})\) that are \(\ell\) trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future.
Code
ggacf(NYSE$log_volume) + ggthemes::theme_few()
Figure 1: The autocorrelation function for log volume. We see that nearby values are fairly strongly correlated, with correlations above 0.2 as far as 20 days apart.
Do a similar plot for (1) the correlation between \(v_t\) and lag \(\ell\)Dow Jones return\(r_{t-\ell}\) and (2) correlation between \(v_t\) and lag \(\ell\)Log volatility\(z_{t-\ell}\).
seq(1, 30) %>%map(function(x) {cor(NYSE$log_volume , lag(NYSE$DJ_return, x), use ="pairwise.complete.obs")}) %>%unlist() %>%tibble(lag =1:30, cor = .) %>%ggplot(aes(x = lag, y = cor)) +geom_hline(aes(yintercept =0)) +geom_segment(mapping =aes(xend = lag, yend =0)) +ggtitle("AutoCorrelation between `log volume` and lagged `DJ return`")
seq(1, 30) %>%map(function(x) {cor(NYSE$log_volume , lag(NYSE$log_volatility, x), use ="pairwise.complete.obs")}) %>%unlist() %>%tibble(lag =1:30, cor = .) %>%ggplot(aes(x = lag, y = cor)) +geom_hline(aes(yintercept =0)) +geom_segment(mapping =aes(xend = lag, yend =0)) +ggtitle("AutoCorrelation between `log volume` and lagged `log volatility`")
1.1 Project goal
Our goal is to forecast daily Log trading volume, using various machine learning algorithms we learnt in this class.
The data set is already split into train (before Jan 1st, 1980, \(n_{\text{train}} = 4,281\)) and test (after Jan 1st, 1980, \(n_{\text{test}} = 1,770\)) sets.
In general, we will tune the lag \(L\) to acheive best forecasting performance. In this project, we would fix \(L=5\). That is we always use the previous five trading days’ data to forecast today’s log trading volume.
Pay attention to the nuance of splitting time series data for cross validation. Study and use the time-series functionality in tidymodels. Make sure to use the same splits when tuning different machine learning algorithms.
Use the \(R^2\) between forecast and actual values as the cross validation and test evaluation criterion.
1.2 Baseline method (20 pts)
We use the straw man (use yesterday’s value of log trading volume to predict that of today) as the baseline method. Evaluate the \(R^2\) of this method on the test data.
L =5for(i inseq(1, L)) { NYSE = NYSE %>%mutate(!!paste("DJ_return_lag", i, sep ="") :=lag(NYSE$DJ_return, i),!!paste("log_volume_lag", i, sep ="") :=lag(NYSE$log_volume, i),!!paste("log_volatility_lag", i, sep ="") :=lag(NYSE$log_volatility, i))}NYSE = NYSE %>%na.omit()
Fit an ordinary least squares (OLS) regression of \(y\) on \(M\), giving \[
\hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L},
\] known as an order-\(L\) autoregression model or AR(\(L\)).
Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance.
random_forest_fit %>%collect_metrics() %>%filter(.metric =="rsq") %>%ggplot(mapping =aes(x = trees, y = mean, color =factor(mtry))) +geom_point() +labs(x ="Number of Trees", y ="RSQ")
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.222
2 rsq standard 0.174
3 mae standard 0.167
1.6 Summary (30pts)
Your score for this question is largely determined by your final test performance.
Summarize the performance of different machine learning forecasters in the following format.
Method
CV \(R^2\)
Test \(R^2\)
Baseline
NA
0.35
AR(5)
0.26
0.22
Random Forest
0.20
0.19
Boosting
0.20
0.17
The baseline had a Test R2 value. The AR(5) method gave a CV R2 of 0.26 with a test R2 of 0.22. The Random Forest Method had a CV R2 of 0.20 and a test R2 of 0.19. The Boosting Method had a CV R2 of 0.20 and a Test R2 of 0.17.
From this, it appears that the Baseline method seemed to provide the best results compared to the other three methods. However, I would say none of these methods do a satisfactory job at predicting the “log_volume” and I would not rely on any of these methods for dependable predicitions. Boosting seemed to do the worst job at predicting “log_volume” compared to the other methods, while the CV R2 seemed to be close to the CV R2 values for AR(5) and Random Forest.
2 ISL Exercise 12.6.13 (90 pts)
2.1 12.6.13 (b) (30 pts)
data =read_csv("../../slides/data/Ch12Ex13.csv", col_names =paste("ID", 1:40, sep =""))head(data)
linkage1 =hclust(as.dist(1-cor(data)), method ="complete")plot(linkage1, main ="Cluster Dendrogram with Complete Linkage")
linkage2 =hclust(as.dist(1-cor(data)), method ="single")plot(linkage2, main ="Cluster Dendrogram with Single Linkage")
linkage3 =hclust(as.dist(1-cor(data)), method ="average")plot(linkage3, main ="Cluster Dendrogram with Average Linkage")
Yes, the genes separate the samples into 2 groups for most of the linkages. The only dendogram that seems to have 3 groups is with a linkage of “average”. The linkages of “single” and “complete” both appear to separate into 2 main groups. The “complete” linkage seems to separate the best according to the dendrogram plots.
2.2 PCA and UMAP (30 pts)
PCA:
pca_values =prcomp(data, scale =TRUE)
pca_recipe =recipe(~., data = data) %>%step_normalize(all_predictors()) %>%step_pca(all_predictors())pca_prep =prep(pca_recipe)
For both UMAP and PCA, there appears to be 2 separate clusters which I presume to indicate the “healthy” and “diseased” groups. One group also seems to have more density than the other.
One method to determine which genes differ the most across the two groups is using Multiple Testing, as done above. Since we know that the first 20 columns are in the “healthy” group and the other 20 are in the “diseased” group, we can use t-test to get a p-value for each of the genes(rows). To correct for the false discovery rate, the Benjamini-Hochberg method is used to adjust the p-values. Then, the index of all the rows that have a p-value less than the 0.05 alpha value(significant) are found, which are the genes that differ the most across both groups.